{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# 06 文档加载器\n",
    "许多LLM应用程序需要用户特定的数据，这些数据不属于模型的训练集。\n",
    "\n",
    "实现此目的的主要方法是通过检索增强生成 （RAG）。在此过程中，检索外部数据，然后在执行生成步骤时传递给LLM。\n",
    "\n",
    "LangChain为RAG应用程序提供了所有构建块 - 从简单到复杂。文档的这一部分涵盖了与检索步骤相关的所有内容 - 例如数据的获取。\n",
    "\n",
    "虽然这听起来很简单，但它可能非常复杂。这包括几个关键模块。"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "从许多不同的来源加载文档。LangChain提供了100多种不同的文档加载器，以及与该领域其他主要提供商的集成，如AirByte和Unstructured。\n",
    "\n",
    "我们提供集成，从所有类型的位置（私有 s3 存储桶、公共网站）加载所有类型的文档（html、PDF、代码）。"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "使用文档加载器以 的形式从 Document 源加载数据。\n",
    "\n",
    "A Document 是一段文本和关联的元数据。\n",
    "\n",
    "例如，有用于加载简单 .txt 文件、加载任何网页的文本内容甚至加载 YouTube 视频脚本的文档加载器。\n",
    "\n",
    "文档加载程序公开一个“加载”方法，用于将数据作为文档从配置的源加载。它们还可以选择实现“延迟加载”，以便将数据延迟加载到内存中。"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [
    {
     "data": {
      "text/plain": "[Document(page_content='## test index.md', metadata={'source': 'documentstore/index.md'})]"
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "\n",
    "loader = TextLoader(\"documentstore/index.md\")\n",
    "loader.load()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T03:26:27.866831Z",
     "start_time": "2023-08-26T03:26:18.928507500Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [],
   "source": [
    "# 加载 CSV 数据，每个文档一行。\n",
    "from langchain.document_loaders.csv_loader import CSVLoader\n",
    "\n",
    "\n",
    "loader = CSVLoader(file_path='documentstore/index.csv')\n",
    "data = loader.load()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T03:28:19.873213Z",
     "start_time": "2023-08-26T03:28:19.865208500Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(page_content='title: red\\ncontext: is a color', metadata={'source': 'documentstore/index.csv', 'row': 0}), Document(page_content='title: watermelon\\ncontext: is a fruit', metadata={'source': 'documentstore/index.csv', 'row': 1}), Document(page_content='title: bike\\ncontext: is a vehicle', metadata={'source': 'documentstore/index.csv', 'row': 2})]\n"
     ]
    }
   ],
   "source": [
    "print(data)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T03:28:27.472272100Z",
     "start_time": "2023-08-26T03:28:27.415811300Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(page_content='title: title\\ncontext: context', metadata={'source': 'documentstore/index.csv', 'row': 0}), Document(page_content='title: red\\ncontext: is a color', metadata={'source': 'documentstore/index.csv', 'row': 1}), Document(page_content='title: watermelon\\ncontext: is a fruit', metadata={'source': 'documentstore/index.csv', 'row': 2}), Document(page_content='title: bike\\ncontext: is a vehicle', metadata={'source': 'documentstore/index.csv', 'row': 3})]\n"
     ]
    }
   ],
   "source": [
    "# 自定义 csv 解析和加载\n",
    "loader = CSVLoader(file_path='documentstore/index.csv', csv_args={\n",
    "    'delimiter': ',',\n",
    "    'quotechar': '\"',\n",
    "    'fieldnames': ['title','context']\n",
    "})\n",
    "\n",
    "data = loader.load()\n",
    "print(data)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T03:30:31.839252200Z",
     "start_time": "2023-08-26T03:30:31.819257Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(page_content='title: red\\ncontext: is a color', metadata={'source': 'is a color', 'row': 0}), Document(page_content='title: watermelon\\ncontext: is a fruit', metadata={'source': 'is a fruit', 'row': 1}), Document(page_content='title: bike\\ncontext: is a vehicle', metadata={'source': 'is a vehicle', 'row': 2})]\n"
     ]
    }
   ],
   "source": [
    "# 使用该 source_column 参数为从每一行创建的文档指定源。否则 file_path 将用作从 CSV 文件创建的所有文档的源。\n",
    "loader = CSVLoader(file_path='documentstore/index.csv', source_column=\"context\")\n",
    "\n",
    "data = loader.load()\n",
    "\n",
    "print(data)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T03:31:28.699257100Z",
     "start_time": "2023-08-26T03:31:28.656220500Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: unstructured[md] in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (0.10.6)\n",
      "Requirement already satisfied: requests in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (2.31.0)\n",
      "Requirement already satisfied: beautifulsoup4 in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (4.12.2)\n",
      "Requirement already satisfied: tabulate in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (0.9.0)\n",
      "Requirement already satisfied: chardet in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (5.2.0)\n",
      "Requirement already satisfied: nltk in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (3.8.1)\n",
      "Requirement already satisfied: filetype in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (1.2.0)\n",
      "Requirement already satisfied: lxml in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (4.9.3)\n",
      "Requirement already satisfied: python-magic in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (0.4.27)\n",
      "Requirement already satisfied: emoji in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from unstructured[md]) (2.8.0)\n",
      "Collecting markdown\n",
      "  Downloading Markdown-3.4.4-py3-none-any.whl (94 kB)\n",
      "     -------------------------------------- 94.2/94.2 kB 335.7 kB/s eta 0:00:00\n",
      "Requirement already satisfied: soupsieve>1.2 in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from beautifulsoup4->unstructured[md]) (2.4.1)\n",
      "Requirement already satisfied: tqdm in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from nltk->unstructured[md]) (4.65.0)\n",
      "Requirement already satisfied: joblib in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from nltk->unstructured[md]) (1.3.2)\n",
      "Requirement already satisfied: click in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from nltk->unstructured[md]) (8.1.6)\n",
      "Requirement already satisfied: regex>=2021.8.3 in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from nltk->unstructured[md]) (2023.6.3)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from requests->unstructured[md]) (2.0.4)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from requests->unstructured[md]) (3.2.0)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from requests->unstructured[md]) (2023.7.22)\n",
      "Requirement already satisfied: idna<4,>=2.5 in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from requests->unstructured[md]) (3.4)\n",
      "Requirement already satisfied: colorama in e:\\pycharmproject\\langchainstudyproject\\venv\\lib\\site-packages (from click->nltk->unstructured[md]) (0.4.6)\n",
      "Installing collected packages: markdown\n",
      "Successfully installed markdown-3.4.4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "[notice] A new release of pip available: 22.3.1 -> 23.2.1\n",
      "[notice] To update, run: python.exe -m pip install --upgrade pip\n"
     ]
    }
   ],
   "source": [
    "! pip install unstructured[md]"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T06:02:11.913563200Z",
     "start_time": "2023-08-26T06:02:08.299896700Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "data": {
      "text/plain": "1"
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 从文件夹加载所有文档\n",
    "from langchain.document_loaders import DirectoryLoader\n",
    "#我们可以使用该 glob 参数来控制要加载的文件。请注意，此处它不会加载 .rst 文件或 .html 文件。\n",
    "loader = DirectoryLoader('E:\\\\pycharmproject\\\\LangChainStudyProject\\\\LangChainStudyProject\\\\documentstore', glob='**/*.md')\n",
    "docs = loader.load()\n",
    "len(docs)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T06:38:43.691071400Z",
     "start_time": "2023-08-26T06:38:43.655954400Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 1/1 [00:00<00:00, 329.30it/s]\n"
     ]
    }
   ],
   "source": [
    "#显示进度条\n",
    "from langchain.document_loaders import DirectoryLoader\n",
    "loader = DirectoryLoader('E:\\\\pycharmproject\\\\LangChainStudyProject\\\\LangChainStudyProject\\\\documentstore', glob=\"**/*.md\", show_progress=True)\n",
    "docs = loader.load()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T06:39:31.234411100Z",
     "start_time": "2023-08-26T06:39:31.185041Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [
    {
     "data": {
      "text/plain": "1"
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用多线程\n",
    "loader = DirectoryLoader('E:\\\\pycharmproject\\\\LangChainStudyProject\\\\LangChainStudyProject\\\\documentstore', glob=\"**/*.md\", use_multithreading=True)\n",
    "docs = loader.load()\n",
    "len(docs)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T06:39:23.958486800Z",
     "start_time": "2023-08-26T06:39:23.932305Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "outputs": [
    {
     "data": {
      "text/plain": "1"
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 更改加载程序类\n",
    "from langchain.document_loaders import PythonLoader\n",
    "loader = DirectoryLoader('E:\\\\pycharmproject\\\\LangChainStudyProject\\\\LangChainStudyProject', glob=\"*.py\", loader_cls=PythonLoader)\n",
    "docs = loader.load()\n",
    "len(docs)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T06:39:08.593974500Z",
     "start_time": "2023-08-26T06:39:08.538980900Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [
    {
     "data": {
      "text/plain": "[Document(page_content='html\\n\\ntest', metadata={'source': 'documentstore/fake-content.html'})]"
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#html\n",
    "from langchain.document_loaders import UnstructuredHTMLLoader\n",
    "loader = UnstructuredHTMLLoader(\"documentstore/fake-content.html\")\n",
    "data = loader.load()\n",
    "data"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T06:42:30.960137700Z",
     "start_time": "2023-08-26T06:42:30.943069Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "outputs": [
    {
     "data": {
      "text/plain": "[Document(page_content='\\n\\nhtml\\ntest\\n\\n', metadata={'source': 'documentstore/fake-content.html', 'title': ''})]"
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 使用 BeautifulSoup4 加载 HTML 4\n",
    "from langchain.document_loaders import BSHTMLLoader\n",
    "loader = BSHTMLLoader(\"documentstore/fake-content.html\")\n",
    "data = loader.load()\n",
    "data"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T06:43:10.964527800Z",
     "start_time": "2023-08-26T06:43:08.814827300Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},\n",
      " 'is_still_participant': True,\n",
      " 'joinable_mode': {'link': '', 'mode': 1},\n",
      " 'magic_words': [],\n",
      " 'messages': [{'content': 'Bye!',\n",
      "               'sender_name': 'User 2',\n",
      "               'timestamp_ms': 1675597571851},\n",
      "              {'content': 'Hi! Im interested in your bag. Im offering $50. Let '\n",
      "                          'me know if you are interested. Thanks!',\n",
      "               'sender_name': 'User 1',\n",
      "               'timestamp_ms': 1675549022673}],\n",
      " 'participants': [{'name': 'User 1'}, {'name': 'User 2'}],\n",
      " 'thread_path': 'inbox/User 1 and User 2 chat',\n",
      " 'title': 'User 1 and User 2 chat'}\n"
     ]
    }
   ],
   "source": [
    "from langchain.document_loaders import JSONLoader\n",
    "import json\n",
    "from pathlib import Path\n",
    "from pprint import pprint\n",
    "\n",
    "\n",
    "file_path='./documentstore/examples.json'\n",
    "data = json.loads(Path(file_path).read_text())\n",
    "pprint(data)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2023-08-26T07:35:44.659715300Z",
     "start_time": "2023-08-26T07:35:42.494134300Z"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
